Intro to Bayesian Inference

…and Likelihoods!
February 7, 2024

Designing a generative data model

Example: Proportions

  • To start, we’ll be interested in the population parameter known as a proportion.
  • For proportions, the random variable (data) of interest is categorical.
  • More specifically, we are interested in the proportion of times in the population that a particular level of a categorical variable occurs.

Example

  • Population: All $1 US bills
  • Variable: Measurable cocaine? (Levels: Yes/No)
  • Parameter: Proportion of bills with measurable cocaine.
  • Sample: 50 $1 bills [BTW, in actual study, 46 had measurable cocaine]

Other examples

  • Proportion of right-handed toads of a certain species
  • Proportion of individuals with a heterozygote genotype (assume two alleles at a particular loci)
  • Proportion of individuals with hazel eyes
  • Proportion of blue marbles in a bag of 4 marbles that are either white or blue
  • Proportion of land area that is water on Earth

Preliminaries - What variable?

Suppose I have a population of size \(N\) and a categorical variable \(Y\) (e.g. \(Y\) could be genotype with levels {aa, Aa, AA}).

For proportions, we really need a categorical variable with only two levels, which would be

  • “Success = has level of interest”
  • “Failure = does not have level of interest”.

Preliminaries - What variable?

Suppose I have a population of size \(N\) and a categorical variable \(Y\) (e.g. \(Y\) could be genotype with levels {aa, Aa, AA}).

For example, with the genotype case, we could be interested in the proportion of heterozygotes in a population, so we would have

  • “Success = Aa”
  • “Failure = not Aa = {aa, AA}”

Preliminaries - What parameter?

Given the categorical variable with levels “Success” and “Failure”, the proportion of successes in the population would be denoted by

\[p = \frac{\mathrm{Number \ of \ successes \ in \ population}}{\mathrm{Total \ population \ size}} = \frac{X}{N}\]

If we have a sample of size \(n\), then we would have a sample estimate of this proportion \(p\) given by

\[\hat{p} = \frac{\mathrm{Number \ of \ successes \ in \ sample}}{\mathrm{Total \ sample \ size}} = \frac{\hat{X}}{n}\]

Data story: Globe tossing

Proportion of land area that is water on Earth

  • Imagine having a ball-sized globe where we know the land area is 70% water.
  • Toss the globe, spinning it enough on the toss so you can guarantee that any point on the globe has an equal probability of landing where your thumb touches.
  • At the point where your thumb touches the globe, record whether that point is water or land.
  • Repeat \(n\) number of times.

Data story: Globe tossing

Proportion of land area that is water on Earth

Example data with 7 tosses: W W L L W L W

This process is known as a binomial process, and the distribution of the number of waters expected in \(n\) tosses is given by the binomial distribution.

Data story: Globe tossing

Definition: The binomial distribution provides the probability distribution for the number of “successes” in a fixed number of independent trials, when the probability of success is the same in each trial.

Properties:

  • Number of trials is fixed (\(n\)) [sample size]
  • Probability of success in each trial (\(p\)) is the same in every trial [proportion of successes in population]
  • Separate trials are independent [random sample]

Data story: Binomial distribution

If we have \(n\) trials, and the probability of success in each trial is \(p\), we have \[\mathrm{Pr[}x \mathrm{ \ successes]} = \left(\begin{array}{c}{n \\ x}\end{array}\right)p^{x}(1-p)^{n-x},\] where \[\left(\begin{array}{c}{n \\ x}\end{array}\right) = \frac{n!}{x!(n-x)!},\]

and \(n! = n\times(n-1)\times(n-2)\cdots 2\times 1.\)

Data story: Binomial distribution

To figure out Pr[\(x\) successes], first ask

Question: “What are all different outcomes of \(x\) successes in \(n\) trials?”

Data story: Binomial distribution

Example: Suppose \(n=3\) and \(x=2\).

\[2 \ \mathrm{successes} = \{SSF, SFS, FSS\}\]

\[\mathrm{Pr}[SSF] = \mathrm{Pr}[S]\times \mathrm{Pr}[S]\times \mathrm{Pr}[F] = p^2(1-p)\]

\[\mathrm{Pr}[SFS] = \mathrm{Pr}[S]\times \mathrm{Pr}[F]\times \mathrm{Pr}[S] = p^2(1-p)\]

\[\mathrm{Pr}[FSS] = \mathrm{Pr}[F]\times \mathrm{Pr}[S]\times \mathrm{Pr}[S] = p^2(1-p)\]

Data story: Binomial distribution

Example: Suppose \(n=3\) and \(x=2\).

\[2 \ \mathrm{successes} = \{SSF, SFS, FSS\}\]

\[\mathrm{Pr}[SSF] = \mathrm{Pr}[SFS] = \mathrm{Pr}[FSS] = p^2(1-p) = p^x(1-p)^{n-x}\]

How many ways are there to have 2 successes in 3 trials? 3 choose 2!!

\[\mathrm{Pr[2 \ successes]} = \left(\begin{array}{c}{3 \\ 2}\end{array}\right)p^2(1-p)\]

Data story: Binomial distribution

To get values of probability distribution, use the dbinom function. Supposing \(n=7\) and \(p=0.7\), we have:

(pdist <- dbinom(x=0:7, size=7, prob=0.7))
[1] 0.0002187 0.0035721 0.0250047 0.0972405 0.2268945 0.3176523 0.2470629
[8] 0.0823543


sum(pdist)
[1] 1

The d in dbinom stands for distribution.

Data story: Binomial distribution

Question: Given \(p=0.3\) and \(n=20\), what is Pr[6 successes]? (Write out using notation)

Answer: \[\mathrm{Pr[}6 \mathrm{ \ successes]} = \left(\begin{array}{c}{20 \\ 6}\end{array}\right)0.3^{6}\times 0.7^{14}.\]

Using R and dbinom,

(ans <- dbinom(x=6, size=20, prob=0.3))
[1] 0.191639

Thus, Pr[6 successes] = 0.191639.

Data story: Binomial distribution

Let’s plot the distribution:

barplot(pdist, 
        names.arg=0:7, 
        col="firebrick", 
        xlab="X (Number of successes)", 
        ylab="Probability")

Data story: Binomial distribution

Let’s plot the distribution:

Data story: Binomial distribution

Question: How do I generate data from a probability distribution?

Data story: Binomial distribution

Random draw from a distribution: Use rbinom.

n <- 7
p <- 0.7
(num_water <- rbinom(1, size = n, prob = p))
[1] 4
(num_land <- n-num_water)
[1] 3

Convert to “raw” data.

(data <- c(rep("W", num_water), rep("L", num_land)))
[1] "W" "W" "W" "W" "L" "L" "L"

Randomize.

(data <- sample(data))
[1] "W" "L" "W" "W" "L" "L" "W"

Data story: Binomial distribution

Reverse: Inferring model parameters from data

Reverse: Inferring model parameters

Reverse: Inferring model parameters

Likelihood

\[ \mathrm{L}(\color{blue}{p}) = \left(\begin{array}{c}\color{red}{n} \\ \color{red}{x}\end{array}\right)\color{blue}{p}^{\color{red}{x}}(1-\color{blue}{p})^{\color{red}{n}-\color{red}{x}} \]

  • NOT a probability distribution
  • \(L(p)=\) relative likelihood of data for different \(p\)
  • \(L(data|p)\)

Likelihood

Maximum Likelihood Estimation

\[ \mathrm{L}(\color{blue}{p}) = \left(\begin{array}{c}\color{red}{n} \\ \color{red}{x}\end{array}\right)\color{blue}{p}^{\color{red}{x}}(1-\color{blue}{p})^{\color{red}{n}-\color{red}{x}} \]

  • Question: What is the most likely proportion given the data?

  • Answer: Maximize the likelihood!

Maximum Likelihood Estimation

Definition: The maximum likelihood estimate is the paramater \(\hat{p}\) that maximizes the likelihood function and is given by \[ \frac{dL}{dp}\bigl(\hat{p}\bigr) = 0 \]

\[ \hat{p} = \frac{x}{n} \]

Summary of our model

Bayesian Inference

Bayesian Inference

  • How do we account for uncertainty in the maximum likelihood estimate?
  • How do we account for prior knowledge about the parameters?
  • How can we update our uncertainty when we obtain new data/knowledge?

Understanding Likelihood

We’ll work through slides 8-22 of McElreath’s slides.

Likelihood to plausibility

McElreath Slides 23-26

p <- c(0, 0.25, 0.5, 0.75, 1) # Parameters
L <- dbinom(2, size=3, prob=p) # Likelihood (data known; parameters unknown)
sum(L) # Does it sum to one?
[1] 0.9375

We could interpret likelihood as probability distribution over parameter (as long as we scale it)!

(p_dist <- L/sum(L))
[1] 0.00 0.15 0.40 0.45 0.00

Same as garden of forking paths!

Bayesian Updating

What do we do when new data arrives?

Bayesian Updating!!

McElreath Slides 27-31

Bayesian Inference

Bayesian Inference

Bayesian Inference

McElreath Slides 32-43

How to implement in R

\[ P(p\,|\,\mathrm{data}) = \frac{P(\mathrm{data}\,|\,p)P(p)}{P(\mathrm{data})} \]

Grid approximation

Data: Suppose we tossed 7 times and got 4 water.

p_grid <- seq(from=0, to=1, length.out=20) # define grid
prior <- rep(1, 20) # define prior
likelihood <- dbinom(4, size=7, prob=p_grid) # compute likelihood at each value in grid
unstd.posterior <- likelihood * prior # compute product of likelihood and prior
posterior <- unstd.posterior / sum(unstd.posterior) # standardize the posterior, so it sums to 1

How to implement in R

Grid approximation (Line 1)

Data: Suppose we tossed 7 times and got 4 water.

p_grid <- seq(from=0, to=1, length.out=20) # define grid
prior <- rep(1, 20) # define prior
likelihood <- dbinom(4, size=7, prob=p_grid) # compute likelihood at each value in grid
unstd.posterior <- likelihood * prior # compute product of likelihood and prior
posterior <- unstd.posterior / sum(unstd.posterior) # standardize the posterior, so it sums to 1

How to implement in R

Grid approximation (Line 2)

Data: Suppose we tossed 7 times and got 4 water.

p_grid <- seq(from=0, to=1, length.out=20) # define grid
prior <- rep(1, 20) # define prior
likelihood <- dbinom(4, size=7, prob=p_grid) # compute likelihood at each value in grid
unstd.posterior <- likelihood * prior # compute product of likelihood and prior
posterior <- unstd.posterior / sum(unstd.posterior) # standardize the posterior, so it sums to 1

How to implement in R

Grid approximation (Line 3)

Data: Suppose we tossed 7 times and got 4 water.

p_grid <- seq(from=0, to=1, length.out=20) # define grid
prior <- rep(1, 20) # define prior
likelihood <- dbinom(4, size=7, prob=p_grid) # compute likelihood at each value in grid
unstd.posterior <- likelihood * prior # compute product of likelihood and prior
posterior <- unstd.posterior / sum(unstd.posterior) # standardize the posterior, so it sums to 1

How to implement in R

Grid approximation (Line 4)

Data: Suppose we tossed 7 times and got 4 water.

p_grid <- seq(from=0, to=1, length.out=20) # define grid
prior <- rep(1, 20) # define prior
likelihood <- dbinom(4, size=7, prob=p_grid) # compute likelihood at each value in grid
unstd.posterior <- likelihood * prior # compute product of likelihood and prior
posterior <- unstd.posterior / sum(unstd.posterior) # standardize the posterior, so it sums to 1

How to implement in R

Grid approximation (Line 5)

Data: Suppose we tossed 7 times and got 4 water.

p_grid <- seq(from=0, to=1, length.out=20) # define grid
prior <- rep(1, 20) # define prior
likelihood <- dbinom(4, size=7, prob=p_grid) # compute likelihood at each value in grid
unstd.posterior <- likelihood * prior # compute product of likelihood and prior
posterior <- unstd.posterior / sum(unstd.posterior) # standardize the posterior, so it sums to 1

How to implement in R

Grid approximation

Data: Suppose we tossed 7 times and got 4 water.

plot(p_grid, posterior, type="b", xlab="probability of water", ylab="posterior probability")

Different priors

prior <- ifelse( p_grid < 0.5, 0, 1 )

prior <- exp( -5*abs( p_grid - 0.5 ) )

Summary

  • Develop a probability model via some process that can generate data (\(Pr[\mathrm{data}|p]\); \(p\) known, data unknown)
  • For inference (\(p\) is unknown, data is known) the probability model becomes the likelihood \(L(\mathrm{data}|p)\) of the data given \(p\)
  • Bayesian inference trains the model on data, leading to a posterior distribution of plausible parameter values, starting with prior knowledge

\[ P[p|\mathrm{data}] = \frac{L(\mathrm{data}|p)P[p]}{P[\mathrm{data}]} \]